ROCm e HIP: Un tutorial detallado de 10 capítulos: La naturaleza centrada en memoria del rendimiento de la GPU

En la aceleración por GPU, debemos abandonar la mentalidad de "cálculo primero". El rendimiento moderno está dictado por Gestión de memoria: la coordinación de la asignación de datos, sincronización y optimización entre el host (CPU) y el dispositivo (GPU).

1. La disparidad entre memoria y cálculo

Mientras que el rendimiento aritmético de la GPU ($TFLOPS$) ha aumentado exponencialmente, la ancho de banda de memoria ($GB/s$) ha crecido a un ritmo mucho más lento. Esto crea un vacío donde las unidades de ejecución suelen estar "privadas", esperando a que los datos lleguen desde la VRAM. En consecuencia, la programación para GPU suele ser programación de memoria.

2. El modelo Roofline

Este modelo visualiza la relación entre Intensidad aritmética (FLOPs/Byte) y el rendimiento. Las aplicaciones típicamente se clasifican en dos categorías:

Limitado por memoria: Limitado por el ancho de banda (la pendiente pronunciada).
Limitado por cálculo: Limitado por el máximo TFLOPS (el techo horizontal).

3. El impuesto del movimiento de datos

El principal cuello de botella del rendimiento rara vez es el cálculo; es la latencia y el costo energético de mover un byte a través del bus PCIe o desde la HBM. El código de alto rendimiento prioriza la residencia de datos y minimiza las transferencias entre host y dispositivo.

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the primary cause of a GPU kernel being 'memory-bound'?

The clock speed of the GPU cores is too slow.

The rate of data delivery is slower than the rate of arithmetic execution.

There are too many threads running in parallel.

The CPU is faster than the GPU.

QUESTION 2

In the context of GPU programming, what does 'Memory Management' involve?

Only allocating variables on the CPU stack.

Controlling allocation, synchronization, and optimization of data transfer between host and device.

Optimizing the cache size of the L1 controller.

Manually cleaning the GPU registers after every kernel call.

QUESTION 3

Which axis of the Roofline Model represents 'Arithmetic Intensity'?

Vertical Axis (Y)

Horizontal Axis (X)

The slope of the line.

The area under the curve.

QUESTION 4

Why is redundant host-device transfer considered a 'performance tax'?

It consumes GPU registers.

Latency and energy consumption of moving data across PCIe is significantly higher than instruction execution.

It increases the floating-point precision error.

It causes the GPU to overheat instantly.

QUESTION 5

If a researcher's kernel spends 95% of its time 'stalled,' what is the most likely culprit?

The math instructions are too complex.

Inefficient orchestration of data residence causing the GPU to wait for data.

The GPU has too much VRAM.

The kernel was written in C++ instead of Python.